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The purpose of this study was to compare the scoring 
of Teacher Performance Assessment Instruments (TPAI) indicators using 
discrete descriptors when some are considered "essential" with the 
scoring of these same indicators, and when no descriptors are 
considered essential. The two questions addressed in this study were: 
(1) To what extent does the use of essential descriptors affect the 
overall "pass-fail rate" for each competency? and (2) To what extent 
does the use of essential descriptors affect the dependability of the 
certification decision? Data was used from tweiity-six teachers who 
volunteered to prepare a lesson plan portfolio and who allowed 
observers to come into their classes. Results of observations were 
scored using two different methods: (1) using essential descriptors 
according to the criteria anticipated when the instruments are used 
in certification; and (2) treating descriptors equally, with no 
essential designation. Analyses were conducted on the transformed 
data obtained using the essential and non-essential scoring svstems. 
Results showed that the essential descriptors did not detract from 
the reliability of the measures; in fact, it was enhanced by them. 
Although the study requires replication with a larger number of 
teachers and more realistic conditions, the results were viewed as 
supportive of the essential descriptor scoring method. (LMO) 
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THE INFLUENCE OF SCORING PROCEDURES ON ASSESSMENT 
DECISIONS AND THEIR RELIABILITY 



Combining observation data to make evaluation decisions is a 
difficult and troublesome process when the performances to be 
evaluated are as complex as is teaching. Typically, there are a 
number of; dimensions included in the assessment. And, each may be 
observed by a number of different individuals on a number of 
different occasions. Little of the work in personnel psychology 
speaks to the problem, since uses of observation data in that 
field are far different from the problems of licensure and 
certification. A performance profile may be created as part of a 
periodic evaluation of personnel with no need for cut-offs, per 
se . And, when personnel evaluations are included in a promotion 
process, they are only one criterion among many with no scoring 
formula. The need for a scoring formula arises when there is a 
large set of candidates, each of whom must be screened with regard 
to specified criteria. 

Teacher licensure, as in the professional certification 
process in Georgia, or merit certification in Florida, are 
examples where standardized scoring procedures are necessary. The 
kinds of measurements to be made are certainly a key factor in the 
scor i ng . In Flor ida 1 s Per forma nee Measurement System , for 
example, observers tally behaviors in each occurrence. Large 
numbers of desirable behaviors are considered to constitute an 
effective performance. Consequently, the scoring system provides 
for summi ng behav ior s and award i ng 11 qua 1 i ty poi nts 11 in propor t ion 
to the number of instances observed. These points are summed, in 
turn, to create a grand sum which is the teacher's score. The 
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score is derived without regard for the ovei 11 profile of 
performance or particular areas of strength or weakness. 

Scor i ng in Georg ia 1 s certification program has been based on 
a different set of assumptions. First, and most important, 
teachers must demonstrate satisfactory performances in a number of 
subareas. As a result there is a limit to the degree to which one 
kind of behavior can compensate for another. Initially, the 
Teacher Performance Assessment Instruments (TPAI) were constructed 
of Behaviorally Anchored Rating Scales (BARS) which described five 
gradations of performance. 

In this original system, the TPAI consisted of a set of broad 
teaching competencies which were defined by second-order 
descriptions of behaviors called indicators. The indicators were 
sentence length statements scored on a scale from 1-5. In order 
to determine or assign a score to each indicator, indicators were 
defined by third-order descriptions of behavior called 
descriptors . Each descr iptor was assigned a scale po Int value . 
After observing an appropriate sample of teaching performance, 
observers selected the descriptor scale point best representing 
the teach i ng per for ma nee observed for each i nd icator . 

As the TPAI use increased , some of the problems inherent with 
BARS emerged. Most notably, the end-points were fairly distinct 
but the mid-points tended to be somewhat less clear. Inasmuch as 
the basic scor ing purpose of the TPAI indicator was to reduce the 
performance to a single satisfactory/not satisfactory decision, 
other methodologies were explored from the outset. The most 
successful was a set of discrete descriptors, as distinct from the 
hierarchical descriptors in BARS. A sample of an indicator and 
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its discrete descriptors is shown in Figure 1. Observers made a 
dichotoinous decision about each descriptor and then the scoring 
rules for the indicator were used to translate these to a single 
decision about the adequacy of the indicator. 

The discrete descriptors helped to avoid some of the 
ambiguity in the BARS , since each descriptor is a distinct 
statement. However , they also carried some limitations in 
scoring. Since the four descriptors were considered equal, any 
one of them could be unsatisfactory and the teacher's performance 
on the indicator could still be satisfactory. This tension 
between the desirability of independent, clearly-stated 
descriptors and the desirability of the weighting implied in the 
hierarchical BARS was not resolved in the current edition of the 
TPAI . 

When the revised TPAI was planned, there was an effort to 
improve clarity and reduce ambiguity throughout. These efforts 
led to the decision to eliminate the hierarchical BARS and replace 
them with discrete descriptors. However, the scoring system was 
modified substantially when certain descriptors were dubbed 
"essential." An indicator could not be scored acceptable unless 
all of its essential descriptors were scored acceptable. A sample 
of a revised indicator is included in Figure 2. This revised 
scoring methodology represented a combination of the desirable 
attributes of the hierarchical and discrete descriptor formats. 

There was some ex pec ta tion that the use of essential 
descriptors would make th^ scoring more difficult for teachers 
and, at the same time, reduce reliability somewhat. However, the 
magnitude of Lhese effects could not be anticipated with no 



performance data. The field-test of a preliminary version of the 
revised TPAI provided an opportunity to investigate the 
psychometric properties of the TPAI when indicators were scored 
with essential descriptors. 

PURPOSE 

The purpose of this study was to compare the scoring of TPAI 
indicators using discrete descriptors when some are considered 
"essential" with the scor i ng of these same i nd ica tor s when no 
descr iptor s are considered essent ial . The two. questions addressed 
in this study were: (1) To what extent does the use of essential 
descriptors affect the overall "pass-fail rate" for each 
competency? and (2) To what extent does the use of essential 
descriptors affect the dependability of the certification 
decision? 

PROCEDURES 

Data from twenty -six teachers were used in the analyses . The 
teachers were volunteers who agreed to prepare a lesson plan 
portfolio and have observers come into their classes. Each 
teacher was observed by four observers: an administrator; a peer, 
and two representatives from a Regional Assessment Center, each of 
whom observed independently . 

The observation data consisted of 104 sets (26 teachers x 4 
obse rvers) of pass/fail decisions for each of the 14 2' desc rip tors 
in the revised TPAI. The results of these observations were 
scored using two different methods. The first scoring system was 
based on the use of essential descriptors, according to the 
criteria anticipated when the instruments are used in 
certification . 



In this system, if all of the descriptors keyed as essential 
for an indicator received passing scores, then credit could also 
he given for passing scores on descriptors not keyed essential • 
If all of the descriptors keyed as essential for an indicator did 
not receive passing scores, then performance of descriptors not 
keyed essential could not be used to compute the indicator score. 

In order to compute raw ind ica tor scores , the number of 
descriptors receiving passing scores was totalled. These raw 
scores ranged from 1 (no descriptors successfully demonstrated ) to 
5 (all four descriptors successfully demonstrated) for each 
indicator. Figure 3 contains a sample of raw indicator scores for 
a hypothetical competency. 

Nex t , these raw scores were com pa red with the mini mum 
standard score set for each of the indicators. If an indicator's 
raw score was equal to or greater than the minimum level, that 
indicator was assigned a transformed score of 1 (acceptable) . If 
the raw score was less than the minimum level,, the indicator was 
assigned a transformed score of 0 (unacceptable). The result was 
a matrix of thirty-five l's and 0's for each observation of each 
teacher. In Figure 4, the data shown in Figure 3 have been 
transformed to reflect an acceptable/unacceptable decision for 
each ind icator . 

In the second method of scoring, all descriptors were treated 

equally, i.e. none of them were designated essential. Raw 

indicator scores consisted of the total number of descriptors 
scored acceptably and ranged from 1 to 5 as in the essential 
scoring system. Transformed scores weere determined exactly as 

they were in che essential scoring system. 
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ANALYSIS 



Two types of analyses were conducted on the transformed data 
obtained using the essential and non-essential scoring systems. 
Pass- Ra t^e 

Each of the twenty-six teachers was scored on all levels of 
the TPAI using both the essential and non-essential scoring 
systems. The portion of indicators scored acceptably by the 
observers was the score . The mean 11 score 11 was computed for both 
of the scoring me t hods . 
Dependabi lity 

General izabi lity theory was used to plan the analyses of the 
field test data. Four facets were identified as important sources 
of variation in the performance data obtained: teachers; 
observers; observer- types ; and performance indicators. The four 
facet design with observers nested within observer- type is 
identical to a three fccet fully crossed design with teachers, 
observer -types, and performance indicators as the sources of 
variation. As a consequence, the simpler three facet model was 
used in all analyses. 

For each anal ys is, teachers were treat ed as facets of 
differentiation and obser ver - type and performance indicators 
within competencies were treated as facets of generalization. All 
facets were regarded as random in the analysis design. 

A dapendabil i ty coefficient ( <!> ) was calculated to assess the 
dependability of the data for making judgements about teacher 
performance relative to a standard (X). 
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According to Brennan (1978): 



o' (T) + ( X - X) 2 - o 2 - 
<J> (X) = X 



o 2 (T) + ( X X) 2 - o 2 - a\ 

(T) = the variance attributable to the facet <rf differentiation: 
«i = the variance components of the observed mean scores; 

X = the mean score; 

* = the cutoff score; 

n 2 

A " vanan ce component due to error 



RESULTS 



The portion of indicators mastered by the twenty-six teachers 
across four observers is shown in Table 1. With no essential 
descriptors, the score was .81 or bl percent. When some 
descriptors were designated essential, the score dropped to .77 or 
77 percent. 

The variance components generated in the general izability 
ana lysis are di splayed in Table 2 . 



TnserE" Ta5Ie 7 a5ouE Ffere 



9 BESS COPY AVAILABLE 



8 



Two coe fficients were derived in each generalizability 
analysis-the generalizability coefficient (p n ) and the 
dependability coefficient { <j> (A)). With no essential descriptors, 
these values were .65 and .89 respectively. When descriptors were 
designated .ssential; the two values were .68 and .91* 
respectively. These results are included in Table 3. 



DISCUSSION 

Consideration of these results must be made in light of th.e 
context in which the study was done. The TPAI which was used was 
a preliminary field-test edition which had had limited use prior 
to the study. Furt\ermore, the observers had had little or no 
training in its use and interpretation.' The RAC members had 
participated in reviews of earlier drafts and had a brief 
orientation meeting. The school site observers may have had an 
orientation meeting . Such arrangements were tolerable in 
field-te sting since the assessments were followed by extensive 
debr iefing to identify problems which required attention in 
instrument revision and/or in training. 

The difference in the mean per for ma nee levels using the two 
scoring systems was expected since the essential descriptors were, 
in essence, rr.aking the requ i r emen ts more spec ific. More than 
likely, it will diminish as the instrument is used for 
certification and teachers attend to the criteria more 
systema t ically . 
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The coefficients computed were not analagous to those 
computed in field tests of the current TPAI. Those analyses, 
which yielded p 2 values in excess of .3/ were different in three 
impor tant ways. First, the raw scores were used in the analyses 
and they had a range of 1-5 whereas these unaccep table /acceptable 
scores can only be 1 or 0 . Second , these earlier studies used a 
design that included an occasion facet and both it and the 
indicators were considered fixed. Finally, the earlier studies 
involved a more refined instrument, trained observers, and 
beginning teachers. 

All of these factors would tend to increase teacher variance 
or decrease error variance and, therefore, increase the magnitude 
of the generalizability coe f ficient. In light of these factors 
the P 2 values near .7 were considered very encouraging. 

To some extent the values of the coefficients may have been 
surpressed by a lack of homogenei ty in the indicators. The 
relatively large variance component associated with the indicators 
supports this interpretat ion. 

The two scoring procedures were not equally reliable, but the 
results were' not in the direction that had been anticipated. 
Because different scores on a single essential descriptor could 
result in different scores when an indicator score was computed, 
it was anticipated that essential descr iptors would diminish 
reliability. Undoubtedly there were disagreements between 

observers on decisions about essential descriptors. However, this 
type of error was offset by increased variance in teacher 
performance that was associated with the essential descriptor 
scoring system. This conclusion can be supported by comparing the 
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relative magnitude of the Teacher and Teacher by Observer by 
Indicator variance components i.i Table 2. This is a situation 
where objectivity and the ability to differentiate teachers can be 
;>een to be quite different concerns. 

The i ncr eased magnitude of the re liability coef ficients may 
be an indication of the validity of the "essent ialness" of the 
essential descriptors. To the extent that the coefficients 
measure internal consistency, the higher values associated with 
essential descriptors suggest a better measure of the construct of 
"teaching" whp.. this scoring system is used. However, the 
validity will await a more appropriate study. 

CONCLUSIONS 

The genera 1 i zabi 1 i ty coefficient of .68 was somewhat lower 
than the analagous value associated with the current TPAI. 
However, the result is considered satisfactory in light of the 
tentative instrument used, the lack of training, and the less 
conservative analysis employed in previous studies. 

The essential descriptors did not detract from the 
reliability of the measures; in fact it was enhanced by them. 
This surprising finding was viewed with considerable relief and a 
good bit of caution given the uncertain stability of variance 
components (Tobin and Capie, 1981). Although the study will 
require replication with a larger number of teachers an.1 more 
realistic conditions, the results are viewed as supportive of the 
essential descriptor scoring method. 
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Indicator 35: Manages disruptive behavior among learners. 

Descr iptors 

a. Behavior of the entire class is monitored throughout the 
lesson . 

b. Learners do not interfere with the work of others or interact 
inappropriately often or for an extended period. 

c. Learners who interact inappropriately or otherwise interfere 
with the work of others are identified and dealt with quickly 
***or*** no learners interfere with instruction* 

d. Learners who interact inappropriately or otherwise interfere 
with the work of others are identified and dealt with 
appropriately (e.g., firmly, with suitable consequences for 
situation, effectively, etc.) ***or*** no learners interfere 
with instruction. 



Figure 1. Sample indicator and discrete descriptors. 
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Indicator 27: Implements activities in a logical sequence. 

Descr iptor s 

a. Lesson is initiated with an interesting introduction. 

S b . Necessary lesson components are addressed . 

8c. Lesson components are sequenced to provide a loyical 
development of lesson content . 

d. Lesson is closed appropriately. 

^Tentative recommendations for essential descriptors 
Figure 2. Revised indicator and essential descriptors. 
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Table 1 

Por tion of Indicators Mas tered 
(1=35) 



Scoring System Score 



No Essential Descriptors .81 
Essential Descriptors .77 



Table 2 

Var iance Components for Different Scoring Systems 

(T=26, 0=4, 1=35) 



Source Non-essential Essential 

Scoring Scoring 



Teacher (T) .009 .010 

Observer (0) .001 .001 

Indicator(I) .321 .028 

TO .C14 .014 

TI .025 .023 

01 .001 .001 

TO I . 085 . 101 
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Table 3 

Reliability Coefficients for Two Different Scoring Systems 

(T=26, 0=4, 1=35) 



Scoring System 

p 2 <j> (A) 

No-Essential Descriptors .65 .89 

Essential Descriptors .68 .51 
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Sample Competency: Teacher Miss Hypothetical 



Indicator Recommended Minimum Raw Score 

Level RAC Peer Adm 



1 

2 
3 
4 



3 
4 
3 
4 



5 
2 
4 
2 



5 
2 
5 
3 



4 
4 
5 
2 



Figure 3. Sample set of raw indicator scores. 



Sample Competency: Teacher Miss Hypothetical 



Indicator Transformed Score 

RAC Peer Adm 



1 
2 
3 
4 



1 
0 
1 
0 



Figure 4. Sample set of transformed indicator scores. 
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